Text localization for image and video OCR

ABSTRACT

In accord with embodiments consistent with the present invention, a first action in recognizing text from image and video is to locate accurately the position of the text in image and video. After that, the located and possibly low resolution text can be extracted, enhanced and binarized. Finally existing OCR technology can be applied to the binarized text for recognition. This abstract is not to be considered limiting, since other embodiments may deviate from the features described in this abstract.

CROSS REFERENCE TO RELATED DOCUMENTS

This application is related to and claims priority benefit of U.S.Provisional Patent Application No. 61/190,992 filed Sep. 30, 2008 to Yu,et al. which is hereby incorporated herein by reference. Thisapplication is related to U.S. patent application Ser. No. 11/706,919filed Feb. 14, 2007, Ser. No. 11/706,890 filed Feb. 14, 2007, Ser. No.11/715,856 filed Mar. 8, 2007 and Ser. No. 11/706,529 filed Feb. 14,2007, all to Candelore, which are hereby incorporated herein byreference.

COPYRIGHT AND TRADEMARK NOTICE

A portion of the disclosure of this patent document contains materialwhich is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction of the patent document or thepatent disclosure, as it appears in the Patent and Trademark Officepatent file or records, but otherwise reserves all copyright rightswhatsoever. Trademarks are the property of their respective owners.

BACKGROUND

In TV video, text is often present which can provide important contentinformation such as name of the advertised product, URL (UniversalResource Locator) of related information, name of the speaker or theplayer, location and date of an event etc. The text, either addedartificially as closed captions, embedded in the scene, can be utilizedto index and retrieve the image and video, analyze the viewers' interestin video content, or provide the viewer related content that can beaccessed from the Internet. However, text embedded in ordinarytelevision or video images present special problems in textidentification and recognition that are not present when the textrecognition is carried out in conventional documents.

BRIEF DESCRIPTION OF THE DRAWINGS

Certain illustrative embodiments illustrating organization and method ofoperation, together with objects and advantages may be best understoodby reference detailed description that follows taken in conjunction withthe accompanying drawings in which:

FIG. 1 is an example flow chart of a text localization processconsistent with certain embodiments of the present invention.

FIG. 2 which is made up of FIGS. 2A and FIG. 2B, is an example imagebefore and after processing in a manner consistent with certainembodiments of the present invention.

FIG. 3 illustrates parameters used in the merging groups of an exampleimplementation consistent with certain embodiments of the presentinvention.

FIG. 4 shows some of the extracted regions after preprocessing thesegmented image of FIG. 2B in a manner consistent with certainembodiments of the present invention.

FIG. 5, which is made up of FIGS. 5A and FIG. 5B, shows stroke widthparameters as used in a manner consistent with certain embodiments ofthe present invention.

FIG. 6, which is made up of FIGS. 6A through FIG. 6F, shows binarizationresults for several examples consistent with certain embodiments of thepresent invention.

REFERENCED DOCUMENTS

The following documents are referenced in the detailed descriptionbelow:

-   [1] Rainer Lienhart. Video OCR: A survey and practitioner's guide,    In Video Mining, Kluwer Academic Publisher, pp. 155-184, October    2003.-   [2] Keechul Jung, Kwang In Kim, and Anil K. Jain, Text information    extraction in images and video: a survey, Pattern Recognition, 37,    pp. 977-997, 2004.-   [3] Jian Liang, David Doermann, and Huiping Li. Camera-based    analysis of text and documents: a survey, IJDAR, vol 7, No 2-3,    2005.-   [4] Anil K. Jain and Bin Yu. Automatic text location in images and    video frames, Pattern Recognition, Vol. 31, No 12, 1998.-   [5] Shio J. Ohya and S. Akamatsu. Recognizing characters in scene    images, IEEE Trans. On Pattern Analysis and Machine Intelligence,    Vol 16, No 2, 1994, pp 214-220.-   [6] C. M. Lee, A. Kankanhalli, Automatic extraction of characters in    complex images, Int. J. Pattern Recognition Artif. Intell. 9(1),    1995, pp 67-82.-   [7] M. A. Smith, T. Kanade, Video skimming for quick browsing based    on audio and image characterization, Technical Report CMU-CS-95-186,    Carnegie Mellon University, July 1995.-   [8] D. Chen, K. Shearer and H. Bourlard, Text enhancement with    asymmetric filter for video OCR. Proceedings of International    Conference on Image Analysis and Processing, Palermo, Italy, 2001,    pp. 192-197.-   [9] H. Li, D. Doermann, O. Kia, Automatic text detection and    tracking in digital video, IEEE Trans. Image Process. 9(1), 2001,    pp. 147-156.-   [10] D. Chen, H. Boulard, J-P. Thiran. Text identification in    complex background using SVM, Proceedings of IEEE Conference on    Computer Vision and Pattern Recognition, Vol. 2, 2001, pp. 621-626.-   [11] Xiangrong Che, Alan L. Yuille, Detecting and reading text in    natural scenes, Proceedings of IEEE Conference on Computer Vision    and Pattern Recognition, Vol. 2, 2004, pp. 366-373.-   [12] Edward K. Wong and Minya Chen, A new robust algorithm for video    text extraction, Pattern Recognition. No. 36, 2003, pp. 1398-1406.-   [13] K. Subramanian, P. Natarajan, M. Decerbo and D. Castanon,    Character-stroke detection for text-localization and extraction,    Proceedings of IEEE Document Analysis and Recognition. Vo. 1, 2007,    pp. 23-26.-   [14] Richard Nock and Frank Nielsen, Statistical Region Merging,    IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. 26,    No. 11, 2004, pp. 1452-1458.-   [15] V. Vapnik, “Statistical learning theory”, John Wiley and Sons,    1998.-   [16] Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for    support vector machines, 2001. Software available at    http://www.csie.ntu.edu.tw/˜cjlin/libsv-   [17] W. Niblack. An Introduction to Digital Image Processing. pp.    115-116, Prentice Hall, 1986.-   [18] N. Otsu. A threshold selection method from gray-level    histograms. IEEE Trans. On Systems, Man and Cybernetics, Vol. 9, No.    1, pp. 62-66, 1979.-   [19] S. D. Yanowitz and A. M. Bruckstein, A new method for image    segmentation, Computer Vision, Graphics and Image Prcoessing CVGIP,    Vol. 46, no. 1, pp. 82-95, 1989.-   [20] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong and R.    Young. ICDAR 2003 robust reading competitions, In 7^(th)    International Conference on Document Analysis and    Recognition-ICDAR2003, 2003-   [21] S. M. Lucas, ICDAR 2005 Text locating competition results,    ICDAR 2005, pp. 80-84.

DETAILED DESCRIPTION

While this invention is susceptible of embodiment in many differentforms, there is shown in the drawings and will herein be described indetail specific embodiments, with the understanding that the presentdisclosure of such embodiments is to be considered as an example of theprinciples and not intended to limit the invention to the specificembodiments shown and described. In the description below, likereference numerals are used to describe the same, similar orcorresponding parts in the several views of the drawings.

The terms “a” or “an”, as used herein, are defined as one or more thanone. The term “plurality”, as used herein, is defined as two or morethan two. The term “another”, as used herein, is defined as at least asecond or more. The terms “including” and/or “having”, as used herein,are defined as comprising (i.e., open language). The term “coupled”, asused herein, is defined as connected, although not necessarily directly,and not necessarily mechanically. The term “program” or “computerprogram” or similar terms, as used herein, is defined as a sequence ofinstructions designed for execution on a computer system. A “program”,or “computer program”, may include a subroutine, a function, aprocedure, an object method, an object implementation, in an executableapplication, an applet, a servlet, a source code, an object code, ashared library/dynamic load library and/or other sequence ofinstructions designed for execution on a computer system. The term“program”, as used herein, may also be used in a second context (theabove definition being for the first context). In the second context,the term is used in the sense of a “television program”. In thiscontext, the term is used to mean any coherent sequence of audio videocontent such as those which would be interpreted as and reported in anelectronic program guide (EPG) as a single television program, withoutregard for whether the content is a movie, sporting event, segment of amulti-part series, news broadcast, etc. The term may also be interpretedto encompass commercial spots and other program-like content which maynot be reported as a program in an electronic program guide.

Reference throughout this document to “one embodiment”, “certainembodiments”, “an embodiment” or similar terms means that a particularfeature, structure, or characteristic described in connection with theembodiment is included in at least one embodiment of the presentinvention. Thus, the appearances of such phrases or in various placesthroughout this specification are not necessarily all referring to thesame embodiment. Furthermore, the particular features, structures, orcharacteristics may be combined in any suitable manner in one or moreembodiments without limitation.

The term “or” as used herein is to be interpreted as an inclusive ormeaning any one or any combination. Therefore, “A, B or C” means “any ofthe following: A; B; C; A and B; A and C; B and C; A, B and C”. Anexception to this definition will occur only when a combination ofelements, functions, steps or acts are in some way inherently mutuallyexclusive. Throughout the present document, various thresholds are usedfor comparisons in certain processes. The thresholds disclosed hereinand by reference to the above reference materials are for reference inconnection with the presently presented example embodiments and are notintended to be limiting on other processes consistent with otherimplementations.

In TV video, text is often present which can provide important contentinformation such as name of the advertised product, URL (UniversalResource Locator) of related information, name of the speaker or theplayer, location and date of an event etc. The text, either addedartificially as closed captions, embedded in the scene, can be utilizedto index and retrieve the image and video, analyze the viewers' interestin video content, or provide the viewer related content that can beaccessed from the Internet. However, text embedded in ordinarytelevision or video images present special problems in textidentification and recognition that are not present when the textrecognition is carried out in conventional documents. Even withrelatively high definition video, text can often be presented inrelatively low resolution that is more difficult to recognize by machinethan is typical with printed text using conventional optical characterrecognition.

Optical character recognition (OCR) technology can be used toautomatically recognizing text from a text document where the resolutionis high enough (e.g., more than 300 dpi), and the foreground texts arepreferably black on simple structured white background. However, inimage or video, the resolution is generally much lower (e.g., 50 dpi oreven lower). The poor quality of the image also comes from the noise dueto sensor, uneven lighting or compression etc. In addition to that,there is distortion brought by perspective, wide-angle-lens, non-planersurfaces, illumination etc. Finally the text can be on a complexbackground with moving objects around. In short, there are manyvariables that conventional OCR technology does not account for whenprocessing images such as video images or frames therefrom. All of theseproblems often make it difficult or even infeasible to directly applythe OCR technology to image and video data.

In accord with embodiments consistent with the present invention, afirst action in recognizing text from image and video is to locateaccurately the position of the text in image and video. This turns outto be a very complex problem. After that, the located and possibly lowresolution text can be extracted, enhanced and binarized. Finallyexisting OCR technology can be applied to the binarized text forrecognition.

The problem of localizing text in image and video data has beenaddressed in a number of ways. There are comprehensive reviews of textlocalization and extraction algorithms in the above references. Thefollowing documents are referenced in the detailed description below:[1], [2] and [3] above. Often the methods are classified asregion-based, edge-based and texture-based methods.

In region-based methods as described in references [4], [5], [6],characters in text are assumed to have the same color. Regions aregenerated by connected component analysis, clustering or segmentationalgorithms. Then heuristics such as the size, the height/width ratio ofthe region, or the baselines, are employed to filter out non-textregions. Finally the remaining regions are classified as text ornon-text either by heuristic constraints or a trained classifier.

Edge-based methods as described in references [7] [7 and [8] are basedon the observation that text exhibits a strong edge against background,and therefore text regions are clusters of edges, so the first step isedge detection. Then by smoothing and merging, edges are clustered.Finally those clusters are classified as text or non-text either byheuristic constraints or a trained classifier.

Texture-based methods make use of texture features to decide whether apixel or a region belongs to text or not. The whole image is scannedpixel by pixel or block by block to extract texture features such aslocal spatial variance, horizontal variance, second order statistics,frequency features, local energy or high order moments of wavelettransform etc. The features then feed into a classifier (neural network[9], support vector machine [10], or adaboosting [11]) to classify thepixel or the block to be text or not. Finally the pixels or blocks aremerged to generate the final text area.

The technique described herein can be broadly characterized as aregion-based text localization method. A fast and effective imagesegmentation algorithm is first utilized to extract regions of similarcolors. After preprocessing where heuristics are applied to filter outregions not supposed to be text, features of each region are analyzed.Based on the observation that strokes in text tend to have similarwidth, stroke features are extracted. In addition, important edgefeatures and fill factor features are subtracted. Finally a supportvector machine (SVM) classifier (a classifier separates objects havingdifferent groupings) is trained to classify regions into text andnon-text. An SVM is used to maximize the difference between text andnon-text.

Stroke features are employed to assist in identifying text. It is notedthat generally the widths of the strokes in text are similar bothhorizontally and vertically. In references [12] and [13], strokefeatures are also used, however, only the horizontal stroke widths arechecked for text detection. Here features of stroke widths in bothhorizontal and vertical direction are extracted. In addition, edge andother important features are combined for classification.

I. System and Process Overview

FIG. 1 is an example flow chart of a video OCR process 100 consistentwith certain embodiments starting at 104. This figure can also be viewedas a system diagram with each block of the figure representing afunctional block of the system that can be implemented using programmedprocessors and/or state machines and/or dedicated hardware. At 108, thesystem receives input images or key frames. Then the image is segmentedinto regions of similar colors at 112. If those regions are assigned arepresentative color, the resulting image resembles one that is renderedin a limited number of colors and the image has a blocky appearance atthe boundaries of the color regions. These regions are filtered byheuristic constraints such as size, aspect ratio, fill factor etc. atpreprocessing block 116. Features of the remaining regions are extractedat 120. These features are fed into an SVM classifier at 124 whichclassifies the regions into text and non-text regions. Those textregions are enhanced and binarized at 128. Finally OCR processing iscarried out in an OCR engine at 132 that may be implemented in the formof a programmed processor. The OCR engine acts on the binarized regionsand outputs the recognition result in the form of recognized text, andthe process ends at 136. The various elements of FIG. 1 are described ingreater detail below.

II. Text Localization by SVM A. Segmentation

In accord with certain implementations, the statistical region mergingalgorithm described in reference [14] is applied to the input image toget regions of similar color, but other algorithms could also be used.For purposes of this document, the term “similar color” means, as usedherein in one example implementation, that the absolute difference ofthe average red, green, blue (R, G, B) values of two regions (where oneregion is designated by the prime symbol, and where the overscorerepresents an average value) are within a merging threshold which can beformulated as follows:

( R− R′ )² <T _(dev),( G− G′ )² <T _(dev),( B− B′ )² <T _(dev),

where T_(dev) is a merging threshold such as those provided in reference[14]. Other merging thresholds and other definitions of similar colormay also be appropriate in other implementations. Unlike most of theother known segmentation algorithms, which use more or less restrictiveassumptions on distributions, this algorithm is currently preferredbecause it is based on an image generation model with few assumptions,which makes the algorithm effective in all kinds of scenarios. Thealgorithm is carried out in three phases. The first phase is tocalculate the color difference of neighboring pixels. The second phaseinvolves sorting the pixels according to their color difference. Thethird phase involves merging pixels with color difference smaller than athreshold so that regions are generated. It has been established thatthe algorithm only suffers over-merging error, and achieves with highprobability a low error in segmentation. Finally the algorithm can beefficiently approximated in linear time/space leading to a fastsegmentation algorithm. FIG. 2 of the above referenced ProvisionalPatent Application 61/190,992 shows an example output of thesegmentation algorithm, and is reproduced here as FIG. 2A and FIG. 2B.

B. Preprocessing

After segmentation, regions of similar colors are obtained. The goal isto classify those regions into text and non-text regions. To improve theefficiency of the classification, those regions that are very unlikelyto be text are first removed. So the following conditions are checked inone implementation:

-   (1) If region_height is smaller than some threshold T_low, or the    region_height is larger than some threshold T_high, the region is    discarded;-   (2) If region_area is smaller than some threshold T_area, the    regions is discarded;-   (3) If the region touches one of the four sides of the image border,    and its height is larger than a threshold T, the region is    discarded;-   (4) If a fill_factor defined as

$\begin{matrix}{{{fill\_ factor} = \frac{{Region}\mspace{14mu} {Area}}{{Bounding}\mspace{14mu} {Box}\mspace{14mu} {Area}}},} & (1)\end{matrix}$

is lower than a threshold T_fill. it is discarded.

-   The above thresholds are selected empirically. In this example    implementation, the values that were used are as follows:-   T_low=10-   T_high=HEIGHT*0.9 (HEIGHT is the height of the image size)-   T_area=12-   T=HEIGHT/2-   T_fill=0.1-   Other values may be suitable for other implementations and the    present values may be further optimized empirically.

Characters tend to appear in clusters and it is much easier to classifyclusters of characters. Characters in the same word usually have thesame color, and most of the time they are horizontally aligned. Due tothe above facts, regions are grouped if their size and color are similarand their horizontal positions are within a threshold.

FIG. 3 depicts the parameters D_(region), D_(top) and D_(bottom) used inmerging or grouping regions in the present example implementation asfollows:

-   The merging or grouping rules used in an example implementation are    as follows,

${{rule}\mspace{14mu} 1.\mspace{14mu} \left( {{height}\mspace{14mu} {similarity}} \right)\frac{\max \left( {{HEIGHT}_{1},{HEIGHT}_{2}} \right)}{\min \left( {{HEIGHT}_{1},{HEIGHT}_{2}} \right)}} < T_{height\_ sim}$

where HEIGHT₁ and HEIGHT₂ are the height of the two regions.

${{rule}\mspace{14mu} 2.\mspace{14mu} \left( {{color}\mspace{14mu} {similarity}} \right){D\left( {c_{1},c_{2}} \right)}} = {\sqrt{\left( {\overset{\_}{R_{1}} - \overset{\_}{R_{2}}} \right)^{2} + \left( {\overset{\_}{G_{1}} - \overset{\_}{G_{2}}} \right)^{2} + \left( {\overset{\_}{B_{1}} - \overset{\_}{B_{2}}} \right)^{2}} < T_{color}}$

where c₁=[ R₁ G₁ B₁ ] and c₂=[ R₂ G₁ B₁ ] are the average color of thetwo regions.

-   rule 3. (region distance) D_(region)<T_(region)    where D_(region) is the horizontal distance of the two regions.-   rule 4. (horizontal alignment) D_(top)<T_(align) or    D_(bottom)<T_(align)    where D_(top) and D_(bottom) are the vertical distances between the    top boundary and bottom boundary. Refer to FIG. 3 for definition of    D_(region), D_(top) and D_(bottom). The thresholds are set    empirically as follows, but other settings may be suitable for other    implementations, rules and rule modifications, and these thresholds    may be further optimized:

T_(height_sim) = 2.5 T_(color) = 80 T_(region) = HEIGHT₁ + HEIGHT₂$T_{align} = {\max \left( {1,{0.4 \cdot \frac{{HEIGHT}_{1} + {HEIGHT}_{2}}{2}}} \right)}$

FIG. 4 shows some of the extracted regions after preprocessing thesegmented image in FIG. 2 of the provisional patent application (FIG.2). In FIG. 4, the boxes depict the boundaries of a box enclosing eachof the example regions, the black areas are the foreground regions andthe white areas are the background regions. The next task is to classifythose regions into text and non-text regions.

C. Feature Extraction

The features of the remaining regions are then extracted. The featureswe used are stroke width feature, edge feature and fill factor feature,which are elaborated as follows.

Stroke Width Feature

FIG. 5, which is made up of FIG. 5A-5B, illustrates the concept ofstroke width. For purposes of this document, stroke width is consideredthe width in pixels in the horizontal direction between two edges of astroke. The actual width is not particularly important in the presentmethod which uses the percentage of neighborhoods whose variance instroke width is within a threshold.

The stroke width feature is based on the observation that the strokewidth within a text element tends to be similar both in verticaldirection and horizontal direction, which is illustrated in FIG. 5. FIG.5A shows that the width of an example stroke in the letter “F” isapproximately constant in a vertical neighborhood defined by the bracket140. The arrows show width in this area of the character F. FIG. 5Bshows that horizontally the strokes have similar widths (i.e.,approximately equal) or can be clustered into groups with similar width.In this figure, each of the reference numbers 144, 148, 152 and 156depict approximately common widths.

The term “neighborhood” as used in the present context is a range ofvertical distance that contains a stroke, and when you say that the textelement is similar in the vertical direction, this means that the widthis within a vertical neighborhood. For the horizontal direction, thestroke width is compared in the same row or with the same y coordinates.

The feature that reflects an approximately constant vertical strokewidth is calculated as follows. First calculated is the standarddeviation of the stroke widths in a vertical neighborhood. A verticalneighborhood is defined as used herein as pixels with coordinates (x,y)where x=c,r≦y≦r+T_(n) for every pixel within the region(c,r)ε[ROW_(region), HEIGHT_(region)]. The feature value s₁ is thepercentage of the neighborhood in the whole region whose standarddeviation of stroke width is within a threshold.

Similarly, the feature of horizontal stroke width is also calculated.The stroke widths in a row are calculated and clustered, where a “row”is defined as pixels with the same y coordinates. Those clusters withmember number less than three in this example, where the member numberis the number of members in the cluster, or the number of strokes withsimilar stroke widths because the clusters are obtained according tosimilarities of the stroke widths. Those clusters with few members or inanother words strokes with few similar stroke widths are considerednoisy or outliers and are excluded for consideration and are thus nottaken into account. In this way, outliers are excluded, where an outlieris defined as the cluster with few members (here less than three membersor stroke clusters with less than three strokes who have similar strokewidths). Another reason of clustering is that there may be differentstrokes in a row. For example, in the upper row of FIG. 5B, there arethree clusters of different stroke widths labeled 148, 152 and 156. Thefeature value s₂ that reflects constant horizontal stroke width is thepercentage of the rows whose standard deviation of horizontal strokewidth is within a threshold, or who can be clustered into groups andstandard deviation of horizontal stroke width in each group is within athreshold.

Based on the observation that there is some distance between strokes oftexts, feature value s₃ is extracted as the average ratio of the currentstroke width and the distance of the current stroke to the nextneighboring stroke.

The last stroke feature s₄ is the ratio of the two stroke widths thatappear the most often.

The following is an example of psudo-code for a process used in anexample implementation for extracting stroke width features:

Psudo-code of extracting stroke width features s1, s2, s3, s4 Features₁: measure of constant vertical stroke widths1=VerticalConstStrokeWidth(img) Input: img: the binary image to beclassified as text or nontext: foreground is black, background is white,that is img(foreground)=0, img(background)=1, the number of rows in theimage is HEIGHT, the number of columns in the image is WIDTH Output: s1:feature value that measures the constant vertical stroke width 1. Foreach pixel (x,y) in img, calculate the stroke width array StrokeWidthMapa. For pixels (x,y) in the background, the stroke width is 0:StrokeWidthMap(x,y)=0; b. For pixels in the foreground, the stroke widthis the distance between the edges of the current stroke. For example infigure A, the pixels in the red line all will have stroke width of30−10=20, that is StrokeWidthMap(10:30, 60)=20. (note: 10:30 meanscoordinates from 10 to 30) 2. An array StrokeWidthMap of stroke widthmap for each pixel (x, y) is obtained (note: StrokeWidthMap has the samedimension as img) 3. For (x, y ε [r, r+Tn]), (in other words, for eachcolumn x, and for each the neighborhood of Tn rows where Tn is definedas Tn=max(2, ceil(HEIGHT/10)) a. Calculate the median of stroke width:medianW=median(StrokeWidthMap(x, r:r+Tn) b. Calculate the standarddeviation of the stroke width stdW=std(StrokeWidthMap(x, r:r+Tn)). c. Ifthe following conditions are satisfied medianW<WIDTH/3 (the medianstroke width not too large)  stdW<medianW*0.5 (standard deviation issmall) then this neighborhood has constant vertical stroke width, soconstStrokeNum=constStrokeNum+1. 4. The feature s₁ is the raio ofneighborhoods that have constant vertical stroke width:s1=constStrokeNum/total,  where total is the number of neighborhoodsthat have strokes Feature s2: measure of constant horizontal strokewidth s2=HorizontalConstStrokeWidth(img) Input: img: the binary image tobe classified as text or nontext: foreground is black, background iswhite, that is img(foreground)=0, img(background)=1. The number of rowsin the image is HEIGHT, the number of coloumns in the image is WIDTHOutput: s2: feature value that measures the constant vertical strokewidth 1. For each row y in img, calculate the stroke width for thecurrent row and get an array StrokeWidth (StrokeWidth has the samenumber of rows as img, and each row has the stroke widths for thestrokes in the current row); 2. For each row y in StrokeWidth: a.Calculate median of the StrokeWidth: median W=median(StrokeWidth(y)); b.Calculate the standard deviation of StrokeWidth:stdW=std(StrokeWidth(y)); c. If the ratio of standard deviation andmedian of stroke width is less than a threshold, that is, stdW/medianW<WidthStdT (threshold WidthStdT=0.5) then it is counted asa row with horizontal constant stroke width, that is,constStrokeNum=constStrokeNum+1 d. Otherwise, cluster theStrokeWidth(y). If any one of the clusters has more than 3 members (notoutliers) and their median and standard deviation satisfiesstdW/medianW<WidthStdT then it is counted as a row with horizontalconstant stroke width, that is, constStrokeNum=constStrokeNum+1; 3.Feature s2 is s2 =constStrokeNum/total where total is the number of rowsin the image that have more than one stroke Feature s3: ratio of thedistance of the current stroke to the next neighboring stroke and thestroke width s3=RatioStrokeDistWidth(img) Input: img: the binary imageto be classified as text or nontext: foreground is black, background iswhite, that is img(foreground)=0, img(background)=1, the number of rowsin the image is HEIGHT, the number of coloumns in the image is WIDTHOutput: s3: the ratio of the distance of the current stroke to the nextneighboring stroke and the stroke width 1. Calculate the stroke widthStrokeWidth (it is the same one as in extracting feature s2); 2.Calculate the distance of the current stroke to the next neighboringstroke StrokeDist; 3. Calculate the ratio ratio=StrokeDist/StrokeWidth;4. Put ratio in the array StrokeDistWidthRatio 5. Features3=median(StrokeDistWidthRatio) Feature s4: ratio of the most commonstroke width s4=RatioMostStrokeWidth(img) Input: img: the binary imageto be classified as text or nontext: foreground is black, background iswhite, that is img(foreground)=0, img(background)=1. The number of rowsin the image is HEIGHT, the number of coloumns in the image is WIDTHOutput: s4: the ratio of the most common stroke width 1. Calculate thehistogram H of the stroke width array StrokeWidth: [H,Xw]=hist(StrokeWidth, 10), where 10 is the bin number for calculatingthe histogram, H is the histogram or the frequency of the bin, Xw is thebin location; 2. Sort the histogram [sH, sI]=sort(H), where sH is thesorted histogram, and sI is the index, that is sH=H(sI); 3. IfsH(1)/sum(sH)==1 (there is only one stroke width), s4=0; 4. Otherwise,s4=Xw(sI(1))/Xw(sI(2)), sI(1) and sI(2) is the index of the most commonstroke width.

In each case above, text widths are measured in pixels, but othermeasurement increments may be possible.

Edge Features

A set of edge features (e₁, e₂, e₃) is also used in the exampleimplementation. In text, there are a lot of edges. So edge features canbe used to assist in text localization. The first edge feature e₁ isbased on the fact that text characters generally have smooth edges. Thefeature value e₁ is calculated as the ratio of a 5×5 neighborhood thathas the same direction, i.e., the block having smooth direction. Edgefeature e₂ is based on the observation that text characters usually haveedges of all directions. The feature value e₂ is the frequency of theedge direction that appears the most often. If the frequency is verylarge, then quite probably it is not text because usually textcharacters have edges of all directions. That is why it is chosen as afeature to differentiate text and non-text regions. Finally is the ratioof the length of the total edges to the area of the region, which cancharacterize the amount of edges that texts have.

The following is the psudo-code used for extracting edge features in anexample implementation:

Psudo-code of extracting Edge Feature (e1, e2, e3) Feature e1: edgesmoothness e1 = EdgeSmoothness(img) Input: img: the binary image to beclassified as text or nontext: foreground is black, background is white,that is img(foreground) = 0, img(background) = 1. The number of rows inthe image is HEIGHT, the number of coloumns in the image is WIDTH.Output: e1: the feature that measures the smoothness of edges 1. Edgeextraction: extract the edge of 8 directions (0, π/4, π/2, 3π/4, π,5π/4, 3π/2, 7π/4) using Sobel edge detection: Edge = SobelEdge(img)where Edge has the same dimension as img and at the place of edge it hasthe value of 1 to 8 depending or the direction of the edge, and at theplace of non- edge it has the value of 0; 2. For (x, y) that satisfyEdge(x, y) ≠ 0 a. Define neighborhood: neighborhood = edge([x − w: x+w], [y − w: y + w]), where $\left\{ {\begin{matrix}{{{if}\mspace{14mu} {img}^{\prime}s\mspace{14mu} {height}\mspace{14mu} {is}\mspace{14mu} {less}\mspace{14mu} {than}\mspace{14mu} 25\text{:}\mspace{14mu} w} = 1} \\{{{Otherwise}\text{:}\mspace{14mu} w} = 2}\end{matrix}\quad} \right.$ b. Get current direction: curDir = Edge(x,y) c. Get the number of pixels that have the current direction curDirNumd. Get the number of edge pixel in the neighborhood, that is,neighborEdgeNum = Length(neighborhood ≠ 0) e. Calculate the ratio of theedge pixels with the same direction: R(x, y) = curDirNum/neighborEdgeNum; 3. Calculate the edge smoothness feature e1 =length(R > T)/length(Edge ≠ 0). Feature e2: Uniformity of edge directione2 = EdgeUniformity(img) Input: img: the binary image to be classifiedas text or nontext: foreground is black, background is white, that isimg(foreground) = 0, img(background) = 1. The number of rows in theimage is HEIGHT, the number of coloumns in the image is WIDTH. Output:e2: the feature that measures the uniformity of edges 1. Quantize the 8edge directions extracted in step 1 of feature e1 into 4 directionsEdge4, that is, (5π/4, 3π/2, 7π/4) in Edge will become (π/4, π/2, 3π/4)in Edge4; 2. Calculate the histogram of the 4 directions: H =hist(Edge4(Edge4 ≠ 0)); 3. Calculate the maximal of H: maxH = max(H), somaxH is the maximal times the direction appears; 4. Calculate the edgeuniformity feature e2 = maxH/sum(H) Feature e3: Amount of edges e3 =EdgeAmount(img) Input: img: the binary image to be classified as text ornontext: foreground is black, background is white, that isimg(foreground) = 0, img(background) = 1. The number of rows in theimage is HEIGHT, the number of coloumns in the image is WIDTH Output:e2: the feature that measures the amount of edges 1. Edge extraction:extract the edge of 8 directions (0, π/4, π/2, 3π/4, π, 5π/4, 3π/2,7π/4) using Sobel edge detection: Edge = SobelEdge(img) where Edge hasthe same dimension as img and at the place of edge it has the value of 1to 8 depending or the direction of the edge, and at the place of non-edge it has the value of 0; 2. Calculate the length of edge: EdgeLength= length(Edge ≠ 0); 3. Calculate the foreground area of img: 4. ForeArea= length(img(foreground)); 5. Calculate the fill factor AreaFill: 6.AreaFill = ForeArea/(WIDTH*HEIGHT); 7. Calculate the feature e3 =EdgeLength/AreaFill.Fill Features A set of fill factor features (f₁, f₂) are also used inthis example implementation. This group of the feature is based on thefact that the foreground of the text fills in the bounding box. It doesnot fill the whole bounding box, or fills little of the bounding box.Also in a small neighborhood, it has the property that the foregrounddoes not fill the whole neighborhood.

The first fill factor feature f₁ describes the filling feature of thewhole region. So it is calculated as the ratio of the foreground areaand the area of the bounding box of the region. The second fill factorfeature f₂ describes the filling feature of a local neighborhood. Theratio of the foreground area in a neighborhood is first calculated. Thefeature value f₂ is the percentage of the neighborhoods that the ratioof the foreground area is greater than a threshold.

The following is the psudo-code for extracting fill factor features asused in an example implementation:

Psudo-code of extracting Fill Factor Feature f1, f2 Feature f1: Fillingfeature of the whole region f1=FillFactorWhole(img) img: the binaryimage to be classified as text or nontext: foreground is black,background is white, that is img(foreground)=0, img(background)=1. Thenumber of rows in the image is HEIGHT, the number of coloumns in theimage is WIDTH. Output: f1: the feature that measures the filling factorof the candidate image 1. Calculate the foreground area of the img: 2.ForeArea=length(img(foreground)); 3. Calculate the whole area of img:WholeArea=WIDTH×HEIGHT; 4. Calculate the feature f1=ForeArea/WholeArea.Feature f2: Filling feature of local neighborhoodsf2=FillFactorNeighborhood(img) img: the binary image to be classified astext or nontext: foreground is black, background is white, that isimg(foreground)=0, img(background)=1. The number of rows in the image isHEIGHT, the number of columns in the image is WIDTH. Output: f2: thefeature that measures the filling factor in local neighborhoods of thecandidate image 1. For (x,y) where x,y increase by stepSize=HEIGHT/3 a.Get the current neighborhood: curN=img(x:x+stepSize, y:y+stepSize); b.Calculate the area of foreground in the current neighborhood:AreaN=length(curN(foreground)); c. Calculate the fill factor of theneighborhood: FillFactorN(j)=AreaN/Area(curN) where j is the index forthe current neighborhood; 2. Get the number of neighborhoods that haslarge fill factor N=length(FillFactorN>T); 3. The feature f2 is thepercentage of those blocks that has large fill factor:f2=N/length(FillFactorN)

D. SVM-Based Classification

SVM is described in reference [15] and is a technique motivated bystatistical learning theory and has been successfully applied tonumerous classification tasks. The key idea is to separate two classeswith a decision surface that has maximum margin. It minimizes a bound onthe generalization error of a model in the high dimensional space ratherthan a training error. In SVM, the learning task is insensitive to therelative numbers of training examples in positive and negative classes(In the detection task here, the negative class has many more samplesthan the positive class.). Therefore SVM is chosen as the preferredclassifier for this example implementation.

The classification problem is a binary classification problem with mlabeled training samples: (x₁, y₁), (x₂, Y₂), . . . , (x_(m), y_(m)),where x_(i)=[s₁ ^(i), s₂ ^(i), s₃ ^(i), s₄ ^(i), e₁ ^(i), e₂ ^(i), e₃^(i), f₁ ^(i), f₂ ^(i)] with each component as defined in section C., a9-dimension feature vector, y_(i)=±1, indicating positive (text) andnegative (non-text) classes (i=1, 2, . . . , m). SVM tries to solve thefollowing problem:

$\begin{matrix}{{\min\limits_{w,b,\xi_{i}}{\frac{1}{2}w^{T}w}} + {C{\sum\limits_{i = 1}^{l}\xi_{i}}}} & (2)\end{matrix}$

subject to y _(i)(w ^(T)φ(x _(i))+b)≧1−ξ_(i)   (3)

Its dual is

$\begin{matrix}{{\min\limits_{\alpha}{\frac{1}{2}\alpha^{T}Q\; \alpha}} - {^{T}\alpha}} & (4)\end{matrix}$

subject to y ^(T) a=0 (0≦a _(i) ≦C,i=1, . . . ,l)   (5)

where e is the vector of all ones, C>0 is the upper bound and is decidedby cross validation, Q is an 1 by 1 semi-definite matrix,Q_(ij)≡y_(i)y_(j)K(x_(i),x_(j)) and K(x_(i),x_(j))≡φ(x_(i))^(T)φ(x_(j))is the kernel, w, a and b are the parameters that decide the separatingplane and should be solved by the optimization process. By choosing anonlinear kernel function, the feature vectors x_(i) can be mapped intoa higher dimensional space by the function φ. The kernel we use isradial basis function kernel

$\begin{matrix}{{K\left( {X,X_{j}} \right)} = {\exp \left\{ \frac{- {{X - X_{j}}}^{2}}{2\sigma^{2}} \right\}}} & (6)\end{matrix}$

where the kernel bandwidth σ was determined through cross validation.Once the parameters w, a and b are decided, the following decisionfunction can be used to classify the regions

$\begin{matrix}{{{sgn}\left( {{\sum\limits_{i = 1}^{l}{y_{i}\alpha_{i}{K\left( {x_{i},x} \right)}}} + b} \right)}.} & (7)\end{matrix}$

The SVM was trained on a set of samples labeled as text or non-text,using the software package named LIBSVM [16]. Cross validation is usedto decide the kernel bandwidth σ and C. The training and testing resultswill be reported in the next section.

E. Enhancement and Binarization

After text regions have been identified, they should be enhanced andbinarized so that OCR software can recognize the text easily. Most OCRsoftware can only recognize text with large enough resolution. So if theheight of the text is less than about 75 pixels (currently), scaling upmay be needed. Before the scaling up, some enhancements can be appliedsuch as histogram equalization, sharpening etc.

Binarization is then applied to the enhanced image. There are differentkinds of binarization algorithms such as Niblack's adaptive binarizationalgorithm [17], Otsu's method [18], and Yanowitz-Bruchstein's method[19] etc. Among those methods, Niblack's method and Otsu's method arewidely used, but other binarization methods can be adapted toimplementations consistent with the present invention. In Niblack'smethod, a threshold T is adaptively determined for each pixel from theintensity statistics within a local window of size r

T _(r)(x)=μ_(r)(x)+kσ _(r)(x)   (8)

where μ and σ are the mean and standard deviation of the pixelintensities within the window. The scalar parameter k is the weightwhich is set to be −0.2. The window size r can be a fixed value oradaptively chosen. In [11], it is proposed that the window size r ischosen as

$\begin{matrix}{{r(x)} = {\min\limits_{r}\left( {{\sigma_{r}(x)} > T_{\sigma}} \right)}} & (9)\end{matrix}$

where T_(σ) is a fixed threshold. The value of T_(σ) is selected so thatwindows with standard deviations less T_(σ) are smooth areas. Here weset T_(σ) as the standard deviation of the background area of thedetected text region. In Otsu's method [18] the binarization thresholdis found by the discriminating criterion, that is, maximizing betweenclass variance, and minimizing within class variance. Otsu's method canbe applied to the whole text region or a fixed-size window or anadaptive as in (9). Due to the fact that in images the background iscomplex, if binarization is applied on the whole image, the non-textobject in the background may also appear in the final binarized image.To avoid that, binarization is also applied to the connected componentsin the detected text region.

FIG. 6, which is made up of FIGS. 6A-6F shows the binarization resultswhen Otsu's method and Niblack's method are applied to the individualcomponent, a fixed window, an adaptive window and the whole boundingbox. The text detected is from the example illustrated at the bottom ofFIG. 2. FIG. 6A shows Otsu's binarization over each connected componentin the detected region. FIG. 6B shows Niblack's binarization in anadaptive window. FIG. 6C shows Otsu's binarization in an adaptivewindow. FIG. 6D shows Otsu's binarization in the whole bounding box.FIG. 6E shows Niblack's binarization in a fixed window. FIG. 6F showsOtsu's binarization in a fixed window From FIG. 6, it can be seen thatthe performance of different binarization methods differs. it shows thatOtsu's method applied in the whole bounding box is the best in thisexample. Those methods were tried in different images, and it wasdetermined that no single method can give the best results on all theimages. So in practice, one possible solution is to feed the results ofdifferent binarization methods into the OCR software and then combinethe recognition results.

III. Experiments and Results

The present algorithms were tested on two sets of data. One isICDAR2003's text localization competition data set [20]. In ICDAR2003'sdata set, there are 248 images in the training set and 251 images in thetest set. In each set there are about 1000 segments of text. Most of theimages in the data set were taken outside with a handheld device. Theother data set was collected from TV programs including images fromnews, commercial ads, sports games etc. There are 489 images in the dataset with nearly 5000 segments of text.

The algorithm was first applied on the ICDAR2003's data set. The imageswere first segmented and preprocessed. In the training data set, afterthe segmentation and preprocessing, there are 841 segments of textregions and 3373 segments of non-text regions. It was noticed that thenumber of text segments is less than the ground-truth. This is partlydue to segmentation error where some text segments are not correctlysegmented, partly due to preprocessing where some text segments aremerged together. SVM was trained on the processed data set. Crossvalidation was used to select the parameters of SVM. The optimalbandwidth σ in (6) is 0.5 and parameter C in (2) is 32. The trained SVMmodel was applied on the test data set. A correct detecting rate of90.02% was obtained from the testing text samples with a false positiveof 6.45%. To compare with other text localization algorithm, theprecision and recall measure [21] was used for measuring textlocalization performance. Table 1 summarizes the performance of thepresent algorithm and the performances of the winners in ICDAR 2003 andICDAR 2005. The present algorithm ranks number three. It is believedthat with careful tuning the parameters in our algorithm, theperformance can be further improved. In Table 1, f is defined as

$f = \frac{1}{{\alpha/p} + {\left( {1 - \alpha} \right)/r}}$

where p is precision an r is recall. Refer to [20] and [21] for detaileddefinition of precision and recall.

TABLE 1 Comparison with winners of ICDAR2003 [20] and ICDAR2005 [21].System Precision Recall f Hinnerk Becker 0.62 0.67 0.62 Alex Chen 0.600.60 0.58 Our Algorithm 0.58 0.45 0.51 Ashida 0.55 0.46 0.50

Next the algorithm was applied to the TV data set. The data set wassplit into two sets, one for training (with 245 images) and the otherfor testing (with 244 images). After segmentation and preprocessing,there are 1100 segments of text regions and 7200 segments of non-textregions in the training set. SVM was trained on the training set. Crossvalidation was used to select the parameters of SVM. The optimalbandwidth σ in (6) is 1 and parameter C in (2) is 16. The trained SVMmodel was applied on the test data set where there are about 850segments of text regions and 6500 segments of non-text regions. Thedetection rate of text was 88.29% with the false positive 9.34%. FIG. 7of the above provisional patent application shows an example detectionresult for an image in TV data set.

The detected text region was enhanced and binarized as described above.Then binarized images were fed into OCR software for recognition. Forexample, the binarized images in FIG. 6 were fed into Scansoft'sOmnipage™ Pro 12 for recognition. The recognized results were listed inTable 2 below. Table 2 shows that the OCR software can recognize almostall the text. By combing the recognition results from differentbinarization schemes and looking up results in a dictionary, correctrecognition results can be usually obtained.

TABLE 2 Recognition results of Scansoft's Omnipage Pro 12 FIG.RECOGNIZED RESULT FIG. 6A )DAY FROM BANK OF AMERICA, MATTEL, MERCK,NETFLIX AND T SCHOOL BOMBING PLOT S C TEEN ACCUSED OF TARGnNG CLASSMATESFIG. 6B DAY FROM BANK OAMERICA, MATTEL, MERCK, NETFLIX AND ,.: SCHOOL;BOMBING.PI.O ~~~ SaP '8. tom 1s,cr~r~AccusEn o~aa~u~a~a~nss r~s FIG. 6C)DAY FROM BANK OF AMERICA, MATTEL, MERCK, NETFUX AND 7 SCHOOL BOMBINGPLOT ..s;P g,s_(—) St TEENACCUSED OF TARGETING CLASSMATES -.a- In FIG.6D )DAY FROM BANK OF AMERICA, MATTEL, MERCK, NETFLIX AND T CNN SCHOOLBOMBING PLOT St TEEN ACCUSED OF TARGETING CLASSMATES FIG. 6E _DAY FROMBANK OF AMERICA, MATTEL, MERCK, NETFLIX AND.; SCHOOL BOMBING.PLOT~8~ ~QL8At1TEEH~AQCU86D_OF,MRGEM‘Q’ ‘ SNIATEB+ FIG. 6F )DAY FROM BANK OFAMERICA, MATTEL, MERCK, NETFUX AND T SCHOOL BOMBING PLOT aw 86P g.sa_(y)i SC. IEEN ACCUSED OF ThRGET1NG CLASSMATES 4 r_-I

As described above, a region-based approach is used to localize text inimage or video. Segmentation is used to get regions of different colors.Then features of each region are extracted. The features extracted hereare stroke features, edge features and fill factor features. Thefeatures are very effective in detecting the text. The extracted featurevectors are used to train SVM model which classifies regions as text ornon-text regions. The algorithm was shown to perform very well on boththe publicly available data set and other data sets.

Thus, in accord with certain example implementations, a method of textdetection in a video image involves, at an image processor, receiving avideo frame that potentially contains text; segmenting the image intoregions having similar color; identifying high likelihood non-textregions from the regions having similar color and discarding the highlikelihood non-text regions; merging remaining regions based onsimilarity of their size and color and alignment of their horizontalpositions; carrying out a feature extraction process to extract strokefeatures, edge features, and fill factor features on the mergedregions,; passing the extracted feature vectors of each regions througha trained binary classifier to decide which regions are text and whichregions are non-text.

In certain implementations, the method further involves passing thebinarized classified text regions through an optical character reader.In certain implementations, segmenting the image into regions of similarcolor is carried out by determining that the absolute difference of theaverage red, green blue (R, G, B) values of two regions are each lessthan a merging threshold. In certain implementations, the segmentinginvolves calculating a color difference of neighboring pixels; sortingthe pixels according to their color difference, and merging pixels withcolor difference smaller than a threshold so that regions are generated.In certain implementations, the binary classifier comprises a supportvector machine (SVM) based classifier. In certain implementations,stroke width are considered similar if the stroke width values arewithin a threshold value. In certain implementations, the stroke widthfeatures comprise a feature value representing a percentage of aneighborhood in the image whose standard deviation of stroke width iswithin a threshold value, wherein the stroke widths values areconsidered the similar if they are within the threshold value. Incertain implementations, the stroke width features comprise thepercentage of the rows whose standard deviation of horizontal strokewidth is within a threshold, or who can be clustered into groups andstandard deviation of horizontal stroke width in each group is within athreshold, or the percentage of rows who have strokes with similarstroke widths. In certain implementations, the stroke width featurecomprises an average ratio of the current stroke width and the distanceof the current stroke to a neighboring stroke. In certainimplementations, the stroke width feature comprises a ratio of twostroke widths that appear the most frequently. In certainimplementations, edge features are measurement of the smoothness ofedges, uniformity of edges and amount of edges in the candidate image,wherein a smoothness of edges is represented by the percentage ofneighborhoods that have the same direction, uniformity of edges iscalculated as the frequency of the edge direction that appears the mostoften, and a number of edges is measured by the ratio of the length ofthe total edges to the area of the region. In certain implementations,fill factor features are extracted both in the whole candidate image andneighborhood-wise.

In certain implementations, a preprocessing process operates todetermine:

-   (1) if region_height is smaller than some threshold T_low, or the    region_height is larger than some threshold T_high, or-   (2) if region_area is smaller than some threshold T_area, or-   (3) if the region touches one of the four sides of the image border,    and its height is larger than a threshold T, or-   (4) if a fill_factor defined as

${{fill\_ factor} = \frac{{Region}\mspace{14mu} {Area}}{{Bounding}\mspace{14mu} {Box}\mspace{14mu} {Area}}},$

is lower than a threshold, then a region is considered to be a highlikelihood non-text region. In certain implementations, the binarizationis carried out using a plurality of binarization methods with eachbinarized output being processing by an optical character reader toproduce multiple outputs that are combined.

Another text detection process consistent with certain implementationsinvolves preprocessing an image by segmentation using statistical regionmerging, to remove regions that are definitely not text and groupingregions based on the criteria of height similarity, color similarity,region distance and horizontal alignment defined as follows:

${{{height}\mspace{14mu} {similarity}\mspace{14mu} {is}\mspace{14mu} {defined}\mspace{14mu} {as}\frac{\max \left( {{HEIGHT}_{1},{HEIGHT}_{2}} \right)}{\min \left( {{HEIGHT}_{1},{HEIGHT}_{2}} \right)}} < T_{h\; e\; i\; g\; h\; t\; \_ \; s\; i\; n\; i}},$

where HEIGHT₁ and HEIGHT₂ are the height of the two regions;

color similarity is defined as

D(c ₁ ,c ₂)=√{square root over (( R ₁ − R ₂ )²+( G ₁ − G ₂ )²+( B₁ − B₂)²)}<T_(color),

where [ R₁ G₁ B₁ ] and c₂=[ R₂ G₂ B₂ ] are the average color of the tworegions;

region distance is defined as D_(region)<T_(region),

where D_(region) is the horizontal distance of the two regions, and

horizontal alignment is defined as D_(top)<T_(align) orD_(bottom)<T_(align), where D_(top) and D_(bottom) are the verticaldistances between the top boundary and bottom boundary;

carrying out a feature extraction process to describe each remainingregion, where each feature is represented by a stroke feature, an edgefeature and a fill factor feature of the region; and

classifying the feature vector by use of a support vector machine (SVM)classifier engine which outputs whether the region is text or not usingthe following equation:

${sgn}\left( {{\sum\limits_{i = 1}^{l}{y_{i}\alpha_{i}{K\left( {x_{i},x} \right)}}} + b} \right)$

to obtain a classification output where 1 indicates the presence of textand −1 indicates the absence of text.

In certain implementations, stroke features comprise percentage ofvertical neighborhood and row that have similar stroke with. In certainimplementations, fill factor features are extracted both in the wholecandidate image and neighborhood-wise. In certain implementations, thepreprocessing operates to determine:

-   (1) if region height is smaller than some threshold T_low, or the    region height is larger than some threshold T_high, or-   (2) if region area is smaller than some threshold T_area, or-   (3) if the region touches one of the four sides of the image border,    and its height is larger than a threshold T, or-   (4) if a fill_factor defined as

$\begin{matrix}{{{fill\_ factor} = \frac{{Region}\mspace{14mu} {Area}}{{Bounding}\mspace{14mu} {Box}\mspace{14mu} {Area}}},} & (10)\end{matrix}$

is lower than a threshold, then a region is considered to be a highlikelihood non-text region. In certain implementations, the binarizationis carried out using a plurality of binarization methods with eachbinarized output being processing by an optical character reader toproduce multiple outputs that are combined.

Those skilled in the art will recognize, upon consideration of the aboveteachings, that certain of the above exemplary embodiments are basedupon use of one or more programmed processors running various softwaremodules that can be arranged as shown in FIG. 1. However, the inventionis not limited to such exemplary embodiments, since other embodimentscould be implemented using hardware component equivalents such asspecial purpose hardware and/or dedicated processors or state machines.Similarly, general purpose computers, microprocessor based computers,micro-controllers, optical computers, analog computers, dedicatedprocessors, application specific circuits and/or dedicated hard wiredlogic may be used to construct alternative equivalent embodiments.

While certain illustrative embodiments have been described, it isevident that many alternatives, modifications, permutations andvariations will become apparent to those skilled in the art in light ofthe foregoing description.

1. A method of text detection in a video image, comprising: at an imageprocessor, receiving a video frame that potentially contains text;segmenting the image into regions having similar color; identifying highlikelihood non-text regions from the regions having similar color anddiscarding the high likelihood non-text regions; merginging thoseregions whose the size and color are similar and their horizontalpositions are within a threshold in the remaining regions; describingthe regions using features by carrying out a feature extraction processto extract stroke features, edge features, and fill factor features; andpassing the remaining regions through a trained binary classifier toobtain the final text regions which can be binarized and recognized byOCR software.
 2. The method according to claim 1, further comprisingpassing the binarized highest likelihood text regions through an opticalcharacter reader.
 3. The method according to claim 1, wherein segmentingthe image into regions of similar color is carried out by determiningthat the absolute difference of the average red, green blue (R, G, B)values of two regions are each less than a merging threshold.
 4. Themethod according to claim 1, wherein the segmenting comprises:calculating a color difference of neighboring pixels; sorting the pixelsaccording to their color difference, and merging pixels with colordifference smaller than a threshold so that regions are generated. 5.The method according to claim 1, wherein the binary classifier comprisesa support vector machine (SVM) based classifier.
 6. The method accordingto claim 1, wherein stroke width values are considered similar if thestroke widths are within a threshold value.
 7. The method according toclaim 1, wherein the stroke width features comprise a feature valuerepresenting the percentage of neighborhoods in the image whose standarddeviation of stroke width is within a threshold value, or percentage ofneighborhoods having similar stroke widths vertically.
 8. The methodaccording to claim 1, wherein the stroke width features comprise afeature value representing the percentage of the rows whose standarddeviation of horizontal stroke width is within a threshold, or who canbe clustered into groups and standard deviation of horizontal strokewidth in each group is within a threshold, or the percentage of the rowshaving similar stroke widths or clusters of similar stroke widths. 9.The method according to claim 1, wherein the stroke width featurecomprises an average ratio of the current stroke width and the distanceof the current stroke to a neighboring stroke.
 10. The method accordingto claim 1, wherein the stroke width feature comprises a ratio of twostroke widths that appear the most frequently.
 11. The method accordingto claim 1, wherein edge features are measurement of the smoothness ofedges, uniformity of edges and amount of edges in the candidate region,wherein a smoothness of edges is represented by the percentage ofneighborhoods that have the same direction, uniformity of edges iscalculated as the frequency of the edge direction that appears the mostoften, and the amount of edges is measured by the ratio of the length ofthe total edges to the area of the region.
 12. The method according toclaim 1, wherein fill factor features are extracted both in the wholecandidate image and neighborhood-wise.
 13. The method according to claim1 wherein regions of high likely-hood of being non-text are decided bythe following: (1) if region_height is smaller than some thresholdT_low, or the region_height is larger than some threshold T_high, or (2)if region_area is smaller than some threshold T_area, or (3) if theregion touches one of the four sides of the image border, and its heightis larger than a threshold T, or (4) if a fill_factor defined as$\begin{matrix}{{{fill\_ factor} = \frac{{Region}\mspace{14mu} {Area}}{{Bounding}\mspace{14mu} {Box}\mspace{14mu} {Area}}},} & (11)\end{matrix}$ is lower than a threshold, then a region is considered tobe a high likelihood non-text region.
 14. The method according to claim1, wherein the binarization is carried out using a plurality ofbinarization methods with each binarized output being processed by anoptical character reader to produce multiple outputs that are combined.15. A text detection process, comprising: preprocessing an image bysegmentation using statistical region merging, removing regions that aredefinitely not text and grouping regions based on the criteria of heightsimilarity, color similarity, region distance and horizontal alignmentdefined as follows: height similarity is defined as${\frac{\max \left( {{HEIGHT}_{1},{HEIGHT}_{2}} \right)}{\min \left( {{HEIGHT}_{1},{HEIGHT}_{2}} \right)} < T_{h\; e\; i\; g\; h\; t\; \_ \; s\; i\; n\; i}},$where HEIGHT₁ and HEIGHT₂ are the height ofthe two regions; colorsimilarity is defined asD(c ₁ ,c ₂)=√{square root over (( R ₁ − R ₂ )²+( G ₁ ×− G ₂ )²+( B ₁ − B₂ )²)}<T_(color), where [ R₁ G₁ B₁ ] and c₂=[ R₂ G₂ B₂ ] are the averagecolor of the two regions; region distance is defined asD_(region)<T_(region), where D_(region) is the horizontal distance ofthe two regions, and horizontal alignment is defined asD_(top)<T_(align) or D_(bottom)<T_(align), where D_(top) and D_(bottom)are the vertical distances between the top boundary and bottom boundary;carrying out a feature extraction process to describe each remainingregion, where each feature is represented by a stroke feature, an edgefeature and a fill factor feature of the region; and classifying thefeature vector by use of a support vector machine (SVM) classifierengine which outputs whether the region is text or not using thefollowing equation:${{sgn}\left( {{\sum\limits_{i = 1}^{l}{y_{i}\alpha_{i}{K\left( {x_{i},x} \right)}}} + b} \right)},$where (x_(i), y_(i))are the feature vectors and groundtruth labels oftraining samples, x is the feature vector of the regions to beclassified, a_(i) and b are the parameters obtained by solving theoptimization problem defined as${\min\limits_{\alpha}{\frac{1}{2}\alpha^{T}Q\; \alpha}} - {^{T}\alpha}$and subject to y^(T)a=0 (0≦a_(i)≦C, i=1, . . . ,1), and K is defined as${K\left( {X,X_{j}} \right)} = {\exp \left\{ \frac{- {{X - X_{j}}}^{2}}{2\sigma^{2}} \right\}}$to obtain a classification output where 1 indicates the presence of textand −1 indicates the absence of text.
 16. The method according to claim15, wherein fill factor features are extracted both in the wholecandidate image and neighborhood-wise.
 17. The method according to claim15, wherein the preprocessing operates to remove the regions satisfyingthe following conditions: (1) if region_height is smaller than somethreshold T_low, or the region_height is larger than some thresholdT_high, or (2) if region_area is smaller than some threshold T_area, or(3) if the region touches one of the four sides of the image border, andits height is larger than a threshold T, or (4) if a fill_factor definedas${{fill\_ factor} = \frac{{Region}\mspace{14mu} {Area}}{{Bounding}\mspace{14mu} {Box}\mspace{14mu} {Area}}},$is lower than a threshold, then a region is considered to be a highlikelihood non-text region and can be excluded from being furtherprocessed.
 18. The method according to claim 15, wherein thebinarization is carried out using a plurality of binarization methodswith each binarized output being processing by an optical characterreader to produce multiple outputs that are combined.